2080.5 - Information Paper: Australian Census Longitudinal Dataset, Methodology and Quality Assessment, 2006-2016 Quality Declaration 
ARCHIVED ISSUE Released at 11:30 AM (CANBERRA TIME) 20/03/2019   
   Page tools: Print Print Page Print all pages in this productPrint All

3. LINKAGE RESULTS, 2011-2016, 2011 PANEL

At the completion of the linkage process 927,520 (76%) of the 1,221,057 records from the 2011 ACLD Panel sample were linked to a 2016 Census record to create the linked 2011-2016 ACLD file with an estimated false link rate of 1.4%.

All results presented in this publication (unless identified in the relevant table) are based on characteristics from the 2011 ACLD Panel sample and have been confidentialised to prevent the identification of individuals.

Table 1 displays the linkage rate for a range of sub-populations.

TABLE 1 - LINKAGE RATES, By Selected Characteristics
2011 Panel sample
Linked records
Linkage rate
(no.)
(no.)
(%)

SEX
Male
600 724
450 092
74.9
Female
620 334
477 426
77.0

AGE GROUP
0-14
236 383
189 641
80.2
15-19
79 971
57 114
71.4
20-24
82 222
52 044
63.3
25-29
85 198
57 331
67.3
30-39
168 979
127 974
75.7
40-49
172 576
139 142
80.6
50-59
155 652
127 702
82.0
60-69
121 036
99 537
82.2
70-74
40 657
32 211
79.2
75 and over
78 384
44 823
57.2

INDIGENOUS STATUS
Non-Indigenous
1 171 794
897 076
76.6
Aboriginal
29 156
18 515
63.6
Torres Strait Islander
1 819
1 174
64.5
Both Aboriginal and Torres Strait Islander
1 243
802
64.6
Not stated
17 050
9 948
58.3

STATE/TERRITORY OF USUAL RESIDENCE
New South Wales
393 519
298 795
75.9
Victoria
304 513
233 623
76.7
Queensland
245 366
183 703
74.9
South Australia
91 555
71 650
78.3
Western Australia
125 449
95 053
75.8
Tasmania
28 580
21 831
76.4
Northern Territory
11 628
7 240
62.3
Australian Capital Territory
20 272
15 530
76.6

REMOTE AREAS
Major Cities
852 825
651 866
76.4
Inner Regional
228 174
174 567
76.5
Outer Regional
110 441
82 485
74.7
Remote
16 570
11 462
69.2
Very Remote
10 201
6 016
59.0
No Usual Address
2 593
1 002
38.6

Total(a)(b)(c)
1 221 057
927 520
76.0

    (a) Data presented in the table have been perturbed. As a result, the sum of individual categories may not align with totals.
    (b) Includes Other Territories.
    (c) Includes Migratory areas.



The linkage rates for the 2011-2016 ACLD were relatively consistent across most sub-populations and were in line with expected results. Compared with the overall linkage rate of 76%, the sub-populations which achieved the highest linkage rates were persons:
  • aged 60 to 69 years (82%), followed by 50 to 59 years (82%) and 0 to 14 years (80%);
  • of non-Indigenous origin (77%);
  • who usually lived in South Australia (78%); and
  • who usually lived in major cities (76%) and inner regional areas (77%).

The sub-populations which achieved the lowest linkage rates were persons:
  • aged 20-24 years (63%) and 75 years and over (57%);
  • of Aboriginal (64%), Torres Strait Islander (65%) or both Aboriginal and Torres Strait Islander origin (65%);
  • who usually lived in the Northern Territory (62%); and
  • who usually lived in remote (69%) and very remote areas (59%) or who had no usual address in 2011 (39%).

Traditionally, the Census Post Enumeration Survey (PES) has shown that the Census has higher rates of undercount for people of Aboriginal and/or Torres Strait Islander origin, those aged between 20 and 29 and for those in the Northern Territory. As expected, the lower ACLD linkage rates broadly aligned with the same groups that experience higher levels of undercount in the 2016 Census. One additional group that had lower linkage rates were persons aged 75 and over at the time of the 2011 Census who, due to age, had an increased risk of death over the ensuing five years. Further information on Census undercount can be found in Census of Population and Housing: Details of Overcount and Undercount, 2016 (cat. no. 2940.0).

Further, data cubes demonstrating the linkage rates for various sub-populations are available as an attachment to this Information paper.


3.1 LINKAGE ACCURACY


The following quality measures were calculated for the ACLD and indicate a good level of overall quality:
  1. The linkage rate, being the proportion of the 2011 ACLD Panel records linked to a 2016 Census record, including both true matches and false links.
  2. The estimated proportion of correctly linked records, otherwise referred to as 'linkage precision'.
  3. The consistency of reporting of common information between record pairs.

3.1.1 Linkage Precision

Not all record pairs assigned as links in a data linkage process are a true match, that is, a record pair belonging to the same individual. While the methodology is designed to ensure that the vast majority of links are true, some are actually false, i.e. the records in the link belong to different people rather than the same person. The linkage strategy used for the ACLD was designed to ensure a high level of accuracy while also achieving a sufficiently high number of links to enable longitudinal research. Accordingly, the strategy was restrictive and conservative.

One of the key measures of linkage quality is the proportion of links in the dataset that are false. The number of false links is able to be estimated through the use of methods such as clerically reviewing a sample of links, or by using modelling techniques. Once an estimate of the number of false links is obtained, a 'precision' can be calculated. The precision is an estimate of the proportion of links that are matches (i.e. belonging to the same entity).
Equation: Precision = (Total links - False link estimate)/Total links
Once the precision of the dataset is estimated, the false link rate is easily calculated.

Equation: False link rate = 1 - Precision

Precision estimation for the ACLD involved conducting clerical review on a stratified random sample of links. Potential links were stratified by their link weight value, with a minimum of 5% of links sampled from each individual link weight value (after rounding down to the nearest integer). After reviewing the sample, the results were used to calculate precision estimates for links grouped by pass and rounded link weight value. These estimates were then applied to the entire set of linkage results. This provided an estimate of precision for each individual link, which can be referred to as 'marginal precision'. Using the marginal precision, the 'cumulative precision' of the final set of one-to-one links could be estimated.

After producing both marginal and cumulative precision estimates, a cut-off point was selected. This cut-off is intended to optimise both the number of links and cumulative precision of the links retained above the cut-off point, while at the same time maintaining a high level of marginal precision for every individual link above the cut-off. The marginal precision estimates were used to select the cut-off, with all links with a marginal precision of at least 81% being retained. This resulted in a final file of 927,520 links once the cut-off was applied, with an estimated cumulative precision of 98.6%, or a false link rate of 1.4%, for these links.

Clerical review relies upon judgment by a well trained individual, therefore, while efforts are taken to minimise the risk, it is possible for a link to be incorrectly assigned as a match or non-match. An alternative way of measuring precision is through the use of models. We applied the method of Chipperfield et al (2018) to provide an independent model-based estimate of the precision. While the clerical estimate of cumulative precision was 98.6%, the model-based approach estimated the precision to be over 99%. The precision as estimated by the clerical review process was retained as the more conservative estimate.

Table 2 provides a summary of the precision estimate and false link rate by the pass where each link was selected (estimated via clerical review).


TABLE 2 - PRECISION ESTIMATES AND FALSE LINK RATES, By Pass Number
Pass Number (a)
Proportion of Overall Links
Estimated True Link Rate / Precision Estimate
Estimate False Link Rate
(no.)
(%)
(%)
(%)

1
72.7
100
0
2
15.7
94.4
5.6
3
1.2
96.4
3.6
4
1.5
95.3
4.7
5
0.8
92.9
7.1
6
1.1
99.8
0.2
7
1.6
96.2
3.8
8
1.0
93.8
6.2
9
4.4
95.9
4.1
Total(b)
100
98.6
1.4

    (a) Pass number 1 refers to the deterministic linkage.
    (b) Data presented in the table have been unperturbed.



The conservative and restrictive nature of the blocking and linking strategy, accompanied by quality controls that were implemented during clerical review, helped to minimise the estimated number of false links throughout the linkage process.

Almost three quarters (73%) of all links were achieved in the first pass of the project, which used a deterministic linking methodology to identify and filter matches. This pass implemented tight geographic and demographic restrictions to maximise the number of high quality links assigned and to limit the amount of alternative comparisons required. Using this approach, links were only accepted if a single unique record pair was identified.

3.1.2 Consistency of Common Information on Record Pairs

In data linkage projects, geographic boundaries function as blocking variables that restrict the search for links to records which agree on the defined geography. They are also used as linking variables, and when combined with other linking fields (such as hashed name, age, sex and date of birth), they provide a high level of uniqueness, and reduce the likelihood of linking to an incorrect record.

Table 3 displays the number of records that had consistent information on key linking variables, grouped by levels of geography.


TABLE 3 - CONSISTENCY OF LINKED RECORDS, By Geography And Selected Linking Fields
Consistency of key linkage fields(a)(b)(c)
(no.)
(%)

MESH BLOCK
First name hash, Surname hash, Age exact, Mesh Block, Sex, DOB Day and Month agree
530,305
57.2
First name hash, Surname hash, Age exact, Mesh Block, Sex agree
160,953
18.3
Age exact, Mesh Block, Sex, DOB Day and Month agree
96,202
10.4
Age exact, Mesh Block, Sex agree
7,176
0.8
Age +/- 2 years, Mesh Block, Sex agree
31,223
3.4

STATISTICAL AREA LEVEL 2
First name hash, Surname hash, Age +/- 2 years, SA2, Sex, DOB Day and Month agree
28,767
3.1
Age exact, Mesh Block, Sex, DOB Day and Month agree
8,677
0.9
Age +/- 2 years, SA2, Sex agree
7,226
0.8

STATISTICAL AREA LEVEL 4
First name hash, Surname hash, Age +/- 2 years, SA4, Sex, DOB Day and Month agree
33,103
3.6
Age +/- 2 years, SA4, Sex, DOB Day and Month agree
8,103
0.9

Total records included
911,735
98.3

Total records linked
927,520
100

    (a) Only includes records that agree on all key linking fields.
    (b) Categories are mutually exclusive. Records that agree in each category are excluded from subsequent categories.
    (c) Percentages may not add up to the total due to rounding.


Over 98% of all records that were matched in the ACLD linkage process agreed on small to medium levels of geographic area combined with other key linking fields, such as first name and surname hash codes, age, sex and date of birth. While the number of consistent fields can give a strong indication of likely linkage quality, other factors should be taken into account, for example, the expected number of people in a geographic area that are likely to share a characteristic by chance. A tolerance of plus or minus one year was used at certain parts of the linkage process to cater for persons who may have understated their age in 2011 and/or overstated it in 2016 or vice versa.

By contrast, record pairs may have inconsistent information and yet be a match. Inconsistent information may be recorded for the same person in different Censuses due to a range of factors, including:
  • transcription errors in the Census, where the wrong category is selected or the information is transposed, such as the day the person was born being reported in the month field instead of in the day field;
  • data capture errors, where the Census form is scanned using Optical Character Recognition (OCR) software and certain characters may be mis-classified, such as a 1 captured as a 7 or a 3 as an 8;
  • reporting errors, where information is given for the wrong member of the household (e.g. person 1's information is reported for person 3) or where the person completing the Census form for a household guesses or estimates information about a fellow household member;
  • information that was not stated by the respondent and has been imputed as part of Census processing (such as age or sex), while set to missing for linking, the imputed values are included in the analytical dataset;
  • census form questions are interpreted differently at each Census; or
  • questions are coded differently for each Census.

Of particular note is inconsistency due to non-reporting of name and date of birth. Respondents are becoming less likely to provide their date of birth, with 90% reporting in the 2011 Census decreasing to 81% reported date of birth in the 2016 Census. Further, just over one per cent of Australians had a missing, or blank, response for first name or surname in the 2016 Census. There appeared to be a relationship between having a missing response for both first name and surname and non-response on other variables. Of the people who did not report first name and surname, approximately half did not report at least one of sex, age, or Indigenous status. The vast majority of missing responses came from paper forms, with the overall level of missing responses in the 2016 Census remaining low.


3.2 CHARACTERISTICS OF LINKED AND UNLINKED 2011 ACLD PANEL SAMPLE

The random sample selected from the 2011 Census for the 2011 ACLD Panel was designed to maximise overlap with the 2006 ACLD Panel, while also being representative of the Australian population by age, sex and jurisdiction as well as other characteristics such as Indigenous status and country of birth. The 2011 Panel sample size was increased in comparison to the 2006 Panel sample size primarily due to the increase in the Australian population from 2006 to 2011. The 2011 Panel size was increased slightly to 5.7%, to achieve a linked sample size closer to 5% of the population after allowing for missed links and people no longer being in scope of the ACLD due to death or overseas migration.

Table 4 shows the distribution of key populations across the 2011 Census, the 2011 ACLD Panel sample and the linked results.


TABLE 4 - SELECTED CHARACTERISTICS, By 2011 Census, 2011 ACLD Panel Sample, ACLD Linked Results
2011 Census
2011 Panel Sample
Linked Results
Weighted Linked Results (a)
(no.)
(%)
(no.)
(%)
(no.)
(%)
(no.)
(%)

SEX
Male
10 634 012
49.4
600 724
49.2
450 092
48.5
10 440 753
49.5
Female
10 873 706
50.6
620 334
50.8
477 426
51.5
10 639 417
50.5

STATE/TERRITORY OF USUAL RESIDENCE
New South Wales
6 917 656
32.2
393 519
32.2
298 795
32.2
6 787 716
32.2
Victoria
5 354 039
24.9
304 513
24.9
233 623
25.2
5 304 805
25.2
Queensland
4 332 727
20.2
245 366
20.1
183 703
19.8
4 223 043
20.0
South Australia
1 596 569
7.4
91 555
7.5
71 650
7.7
1 548 407
7.3
Western Australia
2 239 171
10.4
125 449
10.3
95 053
10.2
2 182 402
10.4
Tasmania
495 351
2.3
25 580
2.3
21 831
2.4
476 403
2.3
Northern Territory
211 943
1.0
11 628
1.0
7 240
0.8
211 411
1.0
Australian Capital Territory
357 218
1.7
20 272
1.7
15 530
1.7
343 595
1.6

AGE GROUP
0-9
2 772 971
12.9
157 597
12.9
126 844
13.7
2 823 442
13.4
10-19
2 776 848
12.9
158 761
13.0
119 912
129
2 822 767
13.4
20-29
2 973 916
13.8
167 423
13.7
109 375
11.8
3 047 805
14.5
30-39
2 973 913
13.8
168 979
13.8
127 974
13.8
2 987 460
14.2
40-49
3 047 023
14.2
172 576
14.1
139 142
15.0
3 050 851
14.5
50-59
2 744 653
12.8
155 652
12.7
127 702
13.8
2 718 221
12.9
60-69
2 125 435
9.9
121 036
9.9
99 537
10.7
2 051 448
9.7
70-79
1 253 349
5.8
71 658
5.9
54 430
5.9
1 098 356
5.2
80 and over
839 609
3.9
47 387
3.9
22 603
2.4
479 854
2.3

INDIGENOUS STATUS
Non-Indigenous
19 900 765
92.5
1 171 794
96.0
897 076
96.7
20 228 715
96.0
Aboriginal and/or Torres Strait Islander
548 368
2.5
32 218
2.6
20 491
2.2
617 382
2.9
Aboriginal
495 754
2.3
29 156
2.4
18 515
2.0
558 748
2.7
Torres Strait Islander
31 407
0.1
1 819
0.1
1 174
0.1
34 407
0.2
Both Aboriginal and Torres Strait Islander
21 205
0.1
1 243
0.1
802
0.1
24 227
0.1
Not stated
1 058 585
4.9
17 050
1.1
9 948
1.1
233 961
1.1

Total (b)(c)(d)
21 507 719
100
1 221 057
100
927 520
100
21 080 214
100

    (a) For more information on weighting see chapter 3.4.
    (b) Data presented in the table have been perturbed. As a result the sum of individual categories may not align with totals.
    (c) Includes Other Territories.
    (d) Includes Migratory areas.


The distribution of the ACLD file by sub-population was generally well aligned with both the 2011 Panel sample and the entire 2011 Census. When looking at the relative difference between these proportions, however, some differences are more clearly observed.

Compared with the entire 2011 Census, the linked 2011 ACLD Panel contains relatively more records for people aged 50-59 years, and to a lesser extent those aged 0-9 years, 40-49 years and 60-69 years. By contrast, the linked 2011 Panel contains relatively fewer records for people aged 20-29 years and 80 years and over. This is consistent with the 2006-2011 ACLD linkage as these subpopulations followed similar linkage rates.

In general, the distribution of weighted counts for the linked ACLD file is close to that of the entire 2011 Census, but it should be noted that the weighting process is not designed to produce counts corresponding to the population in 2011. Rather, the weighted population is that of people who were in scope of both the 2011 and 2016 Censuses (see Section 3.4 Weighting). Thus, for example, the lower proportion of older people in the linked file, even after weighting, reflects the impact on the 2011 Panel sample of deaths that occurred between 2011 and 2016.

Further data cubes demonstrating more detailed population distributions are provided as an attachment to this Information paper.


3.3 REASONS FOR UNLINKED RECORDS

There are two main reasons why records from the 2011 Panel sample were not linked to a 2016 Census record:
  1. records belonging to the same individual were present in the 2011 Panel sample and the 2016 Census but these records failed to be linked because they contained missing or inconsistent information; or
  2. there was no 2016 Census record corresponding to the 2011 Panel sample record because the person was not counted in the 2016 Census.

3.3.1 Missing and/or inconsistent information

In these cases, the true match was present in the pool of all record pairs but it was not identified because there was a high level of inconsistency between information on the 2011 ACLD Panel sample record and the 2016 Census record, or key linking fields were missing altogether. The reasons for the match being missed can be categorised into the following groups:
  • the missing or inconsistent information did not allow the record pair to be compared in the same blocking categories and could not be linked;
  • the record pair did not contain enough unique common information to distinguish the match from other potential record pairs;
  • the record pair was linked, but was attributed a low link weight as it contained a lot of missing or inconsistent information and was positioned below the cut-off identified in sample clerical review; or
  • the record pair was subjected to clerical review, but the high level of inconsistency did not enable it to be deemed a true link.

Accurate address coding was crucial in narrowing the search and differentiating between true and false links. It was a particular challenge for persons who had moved, since linkage was then dependent on the information supplied in 2016 about the person's address in 2011. Processing for the 2016 Census involved coding for address five years ago to a fine level of geography, ideally Mesh Block. This was not always possible, due to insufficient and/or incorrect address information being supplied for some persons, potentially due to recall issues.

3.3.2 No 2016 Census record

A person included in the 2011 ACLD Panel sample may have had no equivalent 2016 Census record because they were no longer in scope for the Census due to migration from Australia, or death between 2011 and 2016, or they may have been missed in the 2016 Census.

According to mortality data compiled by the ABS from data supplied by the Registrars of Births, Deaths and Marriages, approximately 913,000 people died in Australia between 2011 and 2016. If 5% of these people were selected in the 2011 Panel sample, then it could be estimated that up to 46,000 people could not have been linked due to death between 2011 and 2016. Similarly, migration data estimates that just over 1.4 million people left Australia as permanent emigrants over the same period, potentially resulting in up to 70,000 people from the 2011 Panel sample being unlikely to have a corresponding 2016 Census record. For more information please refer to the relevant releases of Migration, Australia (cat. no. 3412.0) and Deaths, Australia (cat. no. 3302.0).

Due to the size and complexity of the Census, it is inevitable that some people are missed and some are counted more than once. It is for this reason that the Census Post Enumeration Survey (PES) is run shortly after each Census, to provide an independent measure of Census coverage. The PES determines how many people should have been counted in the Census, how many were missed (undercount), and how many were counted more than once (overcount). It also provides information on the characteristics of those in the population who have been under- or overcounted.

The net undercount rate for the 2016 Census was 1%, with a higher rate for Aboriginal and Torres Strait Islander people than for the non-Indigenous population. Thus approximately 12,000 people from the 2011 Panel sample could have been missed in the 2016 Census. This estimate is a starting point only and does not take into account the likelihood of people being missed in successive Censuses. For more information please refer to Census of Population and Housing: Details of Overcount and Undercount, 2016 (cat. no. 2940.0).

When taking into account all of these factors, it is estimated that approximately 40% of the unlinked 2011 ACLD Panel sample (128,000 out of the 293,000 unlinked records) would not have a corresponding record in the 2016 Census. This would indicate that the initial linkage rate of 76% could be representative of up to 85% of the population that actually had an opportunity to be linked.


3.4 WEIGHTING

Weighting is the process of adjusting a sample to infer results for the relevant population. To do this, a 'weight' is allocated to each sample unit - in this case, persons. The weight can be considered an indication of how many people in the relevant population are represented by each person in the sample. Weights were created for linked records in the ACLD to enable longitudinal population estimates to be produced. Cross-sectional population estimates for 2011 and 2016 are available from each Census.

The 2011 Panel of the ACLD is a random sample of 5% of the Australian population in 2011. As such, each person in the sample should represent about 20 people in the population. Between Censuses, however, the in scope population changes as people die or move overseas. In addition, Census net undercount and data quality can affect the capacity to link equivalent records across waves. The weights of the linked records on the ACLD were calibrated to the estimated population that was in scope of both the 2011 and 2016 Censuses, 21,080,214 persons. The weights were based on four components: the design weight, undercoverage adjustment, missed link adjustment and population benchmarking.

The mean final weight for the linked records is 22.3 for females and 23.2 for males. The weights range between 14.8 and 83. The mean weight was higher for Aboriginal and Torres Strait Islander persons and for people in the Northern Territory.

The population benchmark is based on the 2016 Estimated Resident Population (ERP), which is adjusted by the estimated probability a person was also in Australia in 2011. This probability is formed using the 2016 Census reported address five year ago variable. Further information on this approach can be found in the paper Chipperfield, Brown & Watson (2016). See References section for details of this publication.

For more information about weighting please refer to the Appendix.